A course in Data Science

Not including…

  • …machine learning…

  • …Big Data architectures…

  • …formal statistical inference…

The Data Science Venn Diagram

Statistical Data Processing

Reproducibility

Copy, Paste, Click

  • As a statistician/actuary/mathematician you will be writing lots of reports.
  • It is tempting to create tables and figures by copy, paste, click in and between Excel sheets.
  • Exemple: Give me a figure of montly consumer price index 2017
    https://www.google.se/search?q=kpi+scb

Problems

  • "Intresting decline at the end, but what month is 13?"
  • "Nice, but I'd rather have a table?"
  • Next year a colleague (or yourself) asks how you did it.

Reproducible data analysis

Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers.

Cf being able to reproduce the result in an independent experiment (replicability).

Reproducible data analysis

  • Everything written in code (no copy-paste of values/tables/figures)

  • Portable (the code should execute, not only on your computer today)

  • Accessible (available openly)

  • Fully automated from raw-data to report

Tools for reproducible data analysis

Code: R or Python?

Code: R or Python!

R: Too many ways to do the same thing…

summary(mtcars$mpg)
summary(mtcars$"mpg")
summary(mtcars[, "mpg"])
summary(mtcars["mpg"])
summary(mtcars[["mpg"]])
summary(mtcars[1])
summary(mtcars[, 1])
summary(mtcars[[1]])
with(mtcars, summary(mpg))
attach(mtcars); summary(mpg)
summary(subset(mtcars, select=mpg))

From http://r4stats.com/articles/why-r-is-hard-to-learn/

Code: Hadleyverse Tidyverse

A suite of R-packages heavily influenced by Hadley Wickham at RStudio. Focus in this course.

Automatic report generation

Automatic report generation: Markdown

Automatic report generation: R Markdown

Adds executable code to Markdown.

knitr: .Rmd → .md

Accessibility (and version control): GitHub and Git

A software for version control and a web-based hosting service.

Accessibility: GitHub

Version control

Bild från http://phdcomics.com/comics/archive.php?comicid=1531

Not strictly necessary for reproducibility, but important for large projects involving multiple coders. In this course it is mainly a side effect of using GitHub for publication, more about version control in Computer Science for Mathematicians (DA3018).

All of this is well integrated to RStudio

Also .Rproj for increased portability.

Summary

  • Everything written in code: R

  • Portable: .Rproj (RStudio)

  • Accessible: GitHub

  • Automated: R Markdown

This course

Textbook

Excercises at DataCamp

Give basic training and preparation for class activities. Not part of examination.

How do i use X to produce Y?

Course structure

  • Preparation: Excersises at DataCamp before class.
  • In class: Mainly short reviews/lectures followed by supervised programming excersises.
  • Examination:
    • 6 homework problems.
    • Digital exam.
    • Individual project.

Homework problems (3 hp, Pass/Fail)

  • Six sets of problems with deadlines the coming six Sundays.
  • To be solved individually.
  • Suspicions of plagiarism (e.g. copying of other student's code) will be reported.
  • Peer review.
  • Missed deadline/fail: Re-examination in February.
  • Individual homeworks are not valid for the next course offering (autumn 2019).

Digital exam (1,5hp, A-F, 21/12)

  • Problem solving in an RStudio environment.
  • Tools allowed: Relevant Cheatsheets from RStudio.

Project (3 hp, A-F, oral presentation and deadline 15/1)

  • A data-blog-post.
  • Illustrate an issue based on an unique data set.
  • Short (5 min) oral presentation.

Guest lecture

19/11: Sebastian Tengborg, Data scientist at